Problem Statement¶

Business Context¶

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective¶

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
  • False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
  • False positives (FP) are detections where there is no failure. These will result in inspection costs.

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

Data Description¶

  • The data provided is a transformed version of original data which was collected using sensors.
  • Train.csv - To be used for training and tuning of models.
  • Test.csv - To be used only for testing the performance of the final best model.
  • Both the datasets consist of 40 predictor variables and 1 target variable

Importing necessary libraries¶

In [1]:
import pandas as pd
import numpy as np

#for visualizations
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

#for missing value imputation
from sklearn.impute import SimpleImputer

#for model buidling
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    RandomForestClassifier,
    BaggingClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier
)
from xgboost import XGBClassifier

from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import (
    accuracy_score, recall_score, f1_score, precision_score, confusion_matrix, roc_auc_score
)
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

#for model tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

#for pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

Loading the dataset¶

In [2]:
#link google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [3]:
data = pd.read_csv('/content/drive/MyDrive/Project ReneWind/Train.csv.csv')
df = data.copy()

df.head()
Out[3]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -4.465 -4.679 3.102 0.506 -0.221 -2.033 -2.911 0.051 -1.522 3.762 -5.715 0.736 0.981 1.418 -3.376 -3.047 0.306 2.914 2.270 4.395 -2.388 0.646 -1.191 3.133 0.665 -2.511 -0.037 0.726 -3.982 -1.073 1.667 3.060 -1.690 2.846 2.235 6.667 0.444 -2.369 2.951 -3.480 0
1 3.366 3.653 0.910 -1.368 0.332 2.359 0.733 -4.332 0.566 -0.101 1.914 -0.951 -1.255 -2.707 0.193 -4.769 -2.205 0.908 0.757 -5.834 -3.065 1.597 -1.757 1.766 -0.267 3.625 1.500 -0.586 0.783 -0.201 0.025 -1.795 3.033 -2.468 1.895 -2.298 -1.731 5.909 -0.386 0.616 0
2 -3.832 -5.824 0.634 -2.419 -1.774 1.017 -2.099 -3.173 -2.082 5.393 -0.771 1.107 1.144 0.943 -3.164 -4.248 -4.039 3.689 3.311 1.059 -2.143 1.650 -1.661 1.680 -0.451 -4.551 3.739 1.134 -2.034 0.841 -1.600 -0.257 0.804 4.086 2.292 5.361 0.352 2.940 3.839 -4.309 0
3 1.618 1.888 7.046 -1.147 0.083 -1.530 0.207 -2.494 0.345 2.119 -3.053 0.460 2.705 -0.636 -0.454 -3.174 -3.404 -1.282 1.582 -1.952 -3.517 -1.206 -5.628 -1.818 2.124 5.295 4.748 -2.309 -3.963 -6.029 4.949 -3.584 -2.577 1.364 0.623 5.550 -1.527 0.139 3.101 -1.277 0
4 -0.111 3.872 -3.758 -2.983 3.793 0.545 0.205 4.849 -1.855 -6.220 1.998 4.724 0.709 -1.989 -2.633 4.184 2.245 3.734 -6.313 -5.380 -0.887 2.062 9.446 4.490 -3.945 4.582 -8.780 -3.383 5.107 6.788 2.044 8.266 6.629 -10.069 1.223 -3.230 1.687 -2.164 -3.645 6.510 0
In [5]:
test_data = pd.read_csv('/content/drive/MyDrive/Project ReneWind/Test.csv.csv')
test_data.head()
Out[5]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -0.613 -3.820 2.202 1.300 -1.185 -4.496 -1.836 4.723 1.206 -0.342 -5.123 1.017 4.819 3.269 -2.984 1.387 2.032 -0.512 -1.023 7.339 -2.242 0.155 2.054 -2.772 1.851 -1.789 -0.277 -1.255 -3.833 -1.505 1.587 2.291 -5.411 0.870 0.574 4.157 1.428 -10.511 0.455 -1.448 0
1 0.390 -0.512 0.527 -2.577 -1.017 2.235 -0.441 -4.406 -0.333 1.967 1.797 0.410 0.638 -1.390 -1.883 -5.018 -3.827 2.418 1.762 -3.242 -3.193 1.857 -1.708 0.633 -0.588 0.084 3.014 -0.182 0.224 0.865 -1.782 -2.475 2.494 0.315 2.059 0.684 -0.485 5.128 1.721 -1.488 0
2 -0.875 -0.641 4.084 -1.590 0.526 -1.958 -0.695 1.347 -1.732 0.466 -4.928 3.565 -0.449 -0.656 -0.167 -1.630 2.292 2.396 0.601 1.794 -2.120 0.482 -0.841 1.790 1.874 0.364 -0.169 -0.484 -2.119 -2.157 2.907 -1.319 -2.997 0.460 0.620 5.632 1.324 -1.752 1.808 1.676 0
3 0.238 1.459 4.015 2.534 1.197 -3.117 -0.924 0.269 1.322 0.702 -5.578 -0.851 2.591 0.767 -2.391 -2.342 0.572 -0.934 0.509 1.211 -3.260 0.105 -0.659 1.498 1.100 4.143 -0.248 -1.137 -5.356 -4.546 3.809 3.518 -3.074 -0.284 0.955 3.029 -1.367 -3.412 0.906 -2.451 0
4 5.828 2.768 -1.235 2.809 -1.642 -1.407 0.569 0.965 1.918 -2.775 -0.530 1.375 -0.651 -1.679 -0.379 -4.443 3.894 -0.608 2.945 0.367 -5.789 4.598 4.450 3.225 0.397 0.248 -2.362 1.079 -0.473 2.243 -3.591 1.774 -1.502 -2.227 4.777 -6.560 -0.806 -0.276 -3.858 -0.538 0

Data Overview¶

  • Observations
  • Sanity checks

Training Data¶

In [7]:
df.shape
Out[7]:
(20000, 41)

The training data set consist of 20000 rows and 41 columns.

In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      19982 non-null  float64
 1   V2      19982 non-null  float64
 2   V3      20000 non-null  float64
 3   V4      20000 non-null  float64
 4   V5      20000 non-null  float64
 5   V6      20000 non-null  float64
 6   V7      20000 non-null  float64
 7   V8      20000 non-null  float64
 8   V9      20000 non-null  float64
 9   V10     20000 non-null  float64
 10  V11     20000 non-null  float64
 11  V12     20000 non-null  float64
 12  V13     20000 non-null  float64
 13  V14     20000 non-null  float64
 14  V15     20000 non-null  float64
 15  V16     20000 non-null  float64
 16  V17     20000 non-null  float64
 17  V18     20000 non-null  float64
 18  V19     20000 non-null  float64
 19  V20     20000 non-null  float64
 20  V21     20000 non-null  float64
 21  V22     20000 non-null  float64
 22  V23     20000 non-null  float64
 23  V24     20000 non-null  float64
 24  V25     20000 non-null  float64
 25  V26     20000 non-null  float64
 26  V27     20000 non-null  float64
 27  V28     20000 non-null  float64
 28  V29     20000 non-null  float64
 29  V30     20000 non-null  float64
 30  V31     20000 non-null  float64
 31  V32     20000 non-null  float64
 32  V33     20000 non-null  float64
 33  V34     20000 non-null  float64
 34  V35     20000 non-null  float64
 35  V36     20000 non-null  float64
 36  V37     20000 non-null  float64
 37  V38     20000 non-null  float64
 38  V39     20000 non-null  float64
 39  V40     20000 non-null  float64
 40  Target  20000 non-null  int64  
dtypes: float64(40), int64(1)
memory usage: 6.3 MB

There are 40 independent variables as type float. The target variable is a binary of type int with 1 meaning a failure and 0 meaning no failure.

In [9]:
df.describe(include='all').T
Out[9]:
count mean std min 25% 50% 75% max
V1 19982.000 -0.272 3.442 -11.876 -2.737 -0.748 1.840 15.493
V2 19982.000 0.440 3.151 -12.320 -1.641 0.472 2.544 13.089
V3 20000.000 2.485 3.389 -10.708 0.207 2.256 4.566 17.091
V4 20000.000 -0.083 3.432 -15.082 -2.348 -0.135 2.131 13.236
V5 20000.000 -0.054 2.105 -8.603 -1.536 -0.102 1.340 8.134
V6 20000.000 -0.995 2.041 -10.227 -2.347 -1.001 0.380 6.976
V7 20000.000 -0.879 1.762 -7.950 -2.031 -0.917 0.224 8.006
V8 20000.000 -0.548 3.296 -15.658 -2.643 -0.389 1.723 11.679
V9 20000.000 -0.017 2.161 -8.596 -1.495 -0.068 1.409 8.138
V10 20000.000 -0.013 2.193 -9.854 -1.411 0.101 1.477 8.108
V11 20000.000 -1.895 3.124 -14.832 -3.922 -1.921 0.119 11.826
V12 20000.000 1.605 2.930 -12.948 -0.397 1.508 3.571 15.081
V13 20000.000 1.580 2.875 -13.228 -0.224 1.637 3.460 15.420
V14 20000.000 -0.951 1.790 -7.739 -2.171 -0.957 0.271 5.671
V15 20000.000 -2.415 3.355 -16.417 -4.415 -2.383 -0.359 12.246
V16 20000.000 -2.925 4.222 -20.374 -5.634 -2.683 -0.095 13.583
V17 20000.000 -0.134 3.345 -14.091 -2.216 -0.015 2.069 16.756
V18 20000.000 1.189 2.592 -11.644 -0.404 0.883 2.572 13.180
V19 20000.000 1.182 3.397 -13.492 -1.050 1.279 3.493 13.238
V20 20000.000 0.024 3.669 -13.923 -2.433 0.033 2.512 16.052
V21 20000.000 -3.611 3.568 -17.956 -5.930 -3.533 -1.266 13.840
V22 20000.000 0.952 1.652 -10.122 -0.118 0.975 2.026 7.410
V23 20000.000 -0.366 4.032 -14.866 -3.099 -0.262 2.452 14.459
V24 20000.000 1.134 3.912 -16.387 -1.468 0.969 3.546 17.163
V25 20000.000 -0.002 2.017 -8.228 -1.365 0.025 1.397 8.223
V26 20000.000 1.874 3.435 -11.834 -0.338 1.951 4.130 16.836
V27 20000.000 -0.612 4.369 -14.905 -3.652 -0.885 2.189 17.560
V28 20000.000 -0.883 1.918 -9.269 -2.171 -0.891 0.376 6.528
V29 20000.000 -0.986 2.684 -12.579 -2.787 -1.176 0.630 10.722
V30 20000.000 -0.016 3.005 -14.796 -1.867 0.184 2.036 12.506
V31 20000.000 0.487 3.461 -13.723 -1.818 0.490 2.731 17.255
V32 20000.000 0.304 5.500 -19.877 -3.420 0.052 3.762 23.633
V33 20000.000 0.050 3.575 -16.898 -2.243 -0.066 2.255 16.692
V34 20000.000 -0.463 3.184 -17.985 -2.137 -0.255 1.437 14.358
V35 20000.000 2.230 2.937 -15.350 0.336 2.099 4.064 15.291
V36 20000.000 1.515 3.801 -14.833 -0.944 1.567 3.984 19.330
V37 20000.000 0.011 1.788 -5.478 -1.256 -0.128 1.176 7.467
V38 20000.000 -0.344 3.948 -17.375 -2.988 -0.317 2.279 15.290
V39 20000.000 0.891 1.753 -6.439 -0.272 0.919 2.058 7.760
V40 20000.000 -0.876 3.012 -11.024 -2.940 -0.921 1.120 10.654
Target 20000.000 0.056 0.229 0.000 0.000 0.000 0.000 1.000

All columns consist of numeric values with varying distributions, we will need to explore more because there is a lack of context for variable meaning due to confidentiality.

In [10]:
df.duplicated().sum()
Out[10]:
0

No duplicate data.

In [11]:
df.isnull().sum()
Out[11]:
V1        18
V2        18
V3         0
V4         0
V5         0
V6         0
V7         0
V8         0
V9         0
V10        0
V11        0
V12        0
V13        0
V14        0
V15        0
V16        0
V17        0
V18        0
V19        0
V20        0
V21        0
V22        0
V23        0
V24        0
V25        0
V26        0
V27        0
V28        0
V29        0
V30        0
V31        0
V32        0
V33        0
V34        0
V35        0
V36        0
V37        0
V38        0
V39        0
V40        0
Target     0
dtype: int64

There are 18 missing values in the first two columns. These will be treated.

Test Data¶

In [12]:
test_data.shape
Out[12]:
(5000, 41)

There are 5000 rows and 41 columns.

In [13]:
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      4995 non-null   float64
 1   V2      4994 non-null   float64
 2   V3      5000 non-null   float64
 3   V4      5000 non-null   float64
 4   V5      5000 non-null   float64
 5   V6      5000 non-null   float64
 6   V7      5000 non-null   float64
 7   V8      5000 non-null   float64
 8   V9      5000 non-null   float64
 9   V10     5000 non-null   float64
 10  V11     5000 non-null   float64
 11  V12     5000 non-null   float64
 12  V13     5000 non-null   float64
 13  V14     5000 non-null   float64
 14  V15     5000 non-null   float64
 15  V16     5000 non-null   float64
 16  V17     5000 non-null   float64
 17  V18     5000 non-null   float64
 18  V19     5000 non-null   float64
 19  V20     5000 non-null   float64
 20  V21     5000 non-null   float64
 21  V22     5000 non-null   float64
 22  V23     5000 non-null   float64
 23  V24     5000 non-null   float64
 24  V25     5000 non-null   float64
 25  V26     5000 non-null   float64
 26  V27     5000 non-null   float64
 27  V28     5000 non-null   float64
 28  V29     5000 non-null   float64
 29  V30     5000 non-null   float64
 30  V31     5000 non-null   float64
 31  V32     5000 non-null   float64
 32  V33     5000 non-null   float64
 33  V34     5000 non-null   float64
 34  V35     5000 non-null   float64
 35  V36     5000 non-null   float64
 36  V37     5000 non-null   float64
 37  V38     5000 non-null   float64
 38  V39     5000 non-null   float64
 39  V40     5000 non-null   float64
 40  Target  5000 non-null   int64  
dtypes: float64(40), int64(1)
memory usage: 1.6 MB

As with the training data all predictor columns are float type and the target column is int type.

In [14]:
test_data.duplicated().sum()
Out[14]:
0

No duplicated data.

In [15]:
test_data.isnull().sum()
Out[15]:
V1        5
V2        6
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
V29       0
V30       0
V31       0
V32       0
V33       0
V34       0
V35       0
V36       0
V37       0
V38       0
V39       0
V40       0
Target    0
dtype: int64

There are a few missing values in the first two columns.

Exploratory Data Analysis (EDA)¶

Plotting histograms and boxplots for all the variables¶

In [16]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

V1¶

In [17]:
round(df['V1'].median(),2)
Out[17]:
-0.75
In [18]:
histogram_boxplot(df, 'V1')

The distribution for V1 is slightly skewed to the right. There are observable outliers with most values being centered around -0.7.

V2¶

In [19]:
round(df['V2'].median(),2)
Out[19]:
0.47
In [20]:
histogram_boxplot(df, 'V2')

The distribution for V2 resembles a normal distribution with outliers on both sides. The data is centered around 0.5.

V3¶

In [21]:
round(df['V3'].median(),2)
Out[21]:
2.26
In [22]:
histogram_boxplot(df, 'V3')

The distribution for V3 resembles a normal distribution with outliers present on both sides. The values are centered around 2.3.

V4¶

In [23]:
round(df['V4'].median(),2)
Out[23]:
-0.14
In [24]:
histogram_boxplot(df, 'V4')

The distribution for V4 resembles a normal distribution with outliers present on both sides. The data is centered around -0.1.

V5¶

In [25]:
round(df['V5'].median(),2)
Out[25]:
-0.1
In [26]:
histogram_boxplot(df, 'V5')

The distribution of V5 resembles a normal distribution with outliers present on both sides. The data is centered around -0.1.

V6¶

In [27]:
round(df['V6'].median(),2)
Out[27]:
-1.0
In [28]:
histogram_boxplot(df, 'V6')

The distribution for V6 resembles a normal distribution with outliers present on both sides. The data is centered around -1.0.

V7¶

In [29]:
round(df['V7'].median(),2)
Out[29]:
-0.92
In [30]:
histogram_boxplot(df, 'V7')

The distribution of V7 resembles a normal distribution with outliers present on both sides. The data is centered around -0.9.

V8¶

In [31]:
round(df['V8'].median(),2)
Out[31]:
-0.39
In [32]:
histogram_boxplot(df, 'V8')

The distribution for V8 resembles a normal distribution with outliers present on both sides. The data is centered around -0.4.

V9¶

In [33]:
round(df['V9'].median(),2)
Out[33]:
-0.07
In [34]:
histogram_boxplot(df, 'V9')

The distribution for V9 resembles a normal distribution with outliers present on both sides. The data is centered around -0.1.

V10¶

In [35]:
round(df['V10'].median(),2)
Out[35]:
0.1
In [36]:
histogram_boxplot(df, 'V10')

The distribution of V10 resembles a normal distribution with outliers present on both sides. The data is centered around 0.1.

V11¶

In [37]:
round(df['V11'].median(),2)
Out[37]:
-1.92
In [38]:
histogram_boxplot(df, 'V11')

The distribution for V11 resembles a normal distribution with outliers present on both sides. The data is centered around -1.9.

V12¶

In [39]:
round(df['V12'].median(),2)
Out[39]:
1.51
In [40]:
histogram_boxplot(df, 'V12')

The distribution for V12 resembles a normal distribution with outliers present on both sides. The data is centered around 1.5.

V13¶

In [41]:
round(df['V13'].median(),2)
Out[41]:
1.64
In [42]:
histogram_boxplot(df, 'V13')

The distribution for V13 resembles a normal distribution with outliers present on both sides. The data is centered around1.6.

V14¶

In [43]:
round(df['V14'].median(),2)
Out[43]:
-0.96
In [44]:
histogram_boxplot(df, 'V14')

The distribution for V14 resembles a normal distribution with outliers present on both sides. The data is centered around-1.0.

V15¶

In [45]:
round(df['V15'].median(),2)
Out[45]:
-2.38
In [46]:
histogram_boxplot(df, 'V15')

The distribution for V15 resembles a normal distribution with outliers present on both sides. The data is centered around -2.4.

V16¶

In [47]:
round(df['V16'].median(),2)
Out[47]:
-2.68
In [48]:
histogram_boxplot(df, 'V16')

The distribution for V16 is slightly skewed but still resembles a normal distribution with outliers present. The data is centered around -2.7.

V17¶

In [49]:
round(df['V17'].median(),2)
Out[49]:
-0.01
In [50]:
histogram_boxplot(df, 'V17')

The distribution for V17 resembles a normal distribution with outliers present. The data is centered around 0.0.

V18¶

In [51]:
round(df['V18'].median(),2)
Out[51]:
0.88
In [52]:
histogram_boxplot(df, 'V18')

The distribution for V18 is slightly skewed to the right with outliers present. The data is centered around 0.9.

V19¶

In [53]:
round(df['V19'].median(),2)
Out[53]:
1.28
In [54]:
histogram_boxplot(df, 'V19')

The distribution for V19 resembles a normal distribution with outliers present on both sides. The data is centered around 1.3.

V20¶

In [55]:
round(df['V20'].median(),2)
Out[55]:
0.03
In [56]:
histogram_boxplot(df, 'V20')

The distribution for V20 resembles a normal distribution with outliers present on both sides. The data is centered around 0.0.

V21¶

In [57]:
round(df['V21'].median(),2)
Out[57]:
-3.53
In [58]:
histogram_boxplot(df, 'V21')

The distribution for V21 resembles a normal distribution with outliers present on both sides. The data is centered around -3.5.

V22¶

In [59]:
round(df['V22'].median(),2)
Out[59]:
0.97
In [60]:
histogram_boxplot(df, 'V22')

The distribution of V22 resembles a normal distribution with outliers present on both sides. The data is centered around 1.0.

V23¶

In [61]:
round(df['V23'].median(),2)
Out[61]:
-0.26
In [62]:
histogram_boxplot(df, 'V23')

The distribution of V23 resembles a normal distribution with outliers present on both sides. The data is centered around -0.3.

V24¶

In [63]:
round(df['V24'].median(),2)
Out[63]:
0.97
In [64]:
histogram_boxplot(df, 'V24')

The distribution of V24 resembles a normal distribution with outliers present on both sides. The data is centered around 1.0.

V25¶

In [65]:
round(df['V25'].median(),2)
Out[65]:
0.03
In [66]:
histogram_boxplot(df, 'V25')

The distribution of V25 resembles a normal distribution with outliers present. The data is centered around 0.0.

V26¶

In [67]:
round(df['V26'].median(),2)
Out[67]:
1.95
In [68]:
histogram_boxplot(df, 'V26')

The distribution of V26 resembles a normal distribution with outliers present on both sides. The data is centered around 2.0.

V27¶

In [69]:
round(df['V27'].median(),2)
Out[69]:
-0.88
In [70]:
histogram_boxplot(df, 'V27')

The distribution for V27 is slightly skewed to the right but still resmbles a normal distribution with outliers present on both sides. The data is centered around -0.9.

V28¶

In [71]:
round(df['V28'].median(),2)
Out[71]:
-0.89
In [72]:
histogram_boxplot(df, 'V28')

The distribution of V28 resembles a normal distribution with outliers on both sides. The data is centered around -0.9.

V29¶

In [73]:
round(df['V29'].median(),2)
Out[73]:
-1.18
In [74]:
histogram_boxplot(df, 'V29')

The distribution for V29 is slighlty skewed right with outliers present on both sides. The data is centered around -1.2.

V30¶

In [75]:
round(df['V30'].median(),2)
Out[75]:
0.18
In [76]:
histogram_boxplot(df, 'V30')

The distribution for V30 is slightly skewed left with outliers present on both sides. The data is centered around 0.2.

V31¶

In [77]:
round(df['V31'].median(),2)
Out[77]:
0.49
In [78]:
histogram_boxplot(df, 'V31')

The distribution of V31 resembles a normal distribution with outliers present on both sides. The data is centered around 0.5.

V32¶

In [79]:
round(df['V32'].median(),2)
Out[79]:
0.05
In [80]:
histogram_boxplot(df, 'V32')

The distribution for V32 is slightly skewed right but still resembles a normal distribution with outliers present on both sides. The data is centered around 0.1.

V33¶

In [81]:
round(df['V33'].median(),2)
Out[81]:
-0.07
In [82]:
histogram_boxplot(df, 'V33')

The distribution of V33 resembles a normal distribution with outliers present on both sides. The data is centered around -0.8.

V34¶

In [83]:
round(df['V34'].median(),2)
Out[83]:
-0.26
In [84]:
histogram_boxplot(df, 'V34')

The distribution for V34 is slightly skewed left but still resembles a normal distribution with outliers present on both sides. The data is centered around -0.3.

V35¶

In [85]:
round(df['V35'].median(),2)
Out[85]:
2.1
In [86]:
histogram_boxplot(df, 'V35')

The distribution for V35 is slightly skewed right but still resembles a normal distribution with outliers present on both sides. The data is centered around 2.1.

V36¶

In [87]:
round(df['V36'].median(),2)
Out[87]:
1.57
In [88]:
histogram_boxplot(df, 'V36')

The distribution for V36 resembles a normal distribution with outliers present on both sides. The data is centered around 1.6.

V37¶

In [89]:
round(df['V37'].median(),2)
Out[89]:
-0.13
In [90]:
histogram_boxplot(df, 'V37')

The distribution for V37 is slightly skewed right with outliers present on both sides. The data is centered around -0.1.

V38¶

In [91]:
round(df['V38'].median(),2)
Out[91]:
-0.32
In [92]:
histogram_boxplot(df, 'V38')

The distribution of V38 resembles a normal distribution with outliers present on both sides. The data is centered around -0.3.

V39¶

In [93]:
round(df['V39'].median(),2)
Out[93]:
0.92
In [94]:
histogram_boxplot(df, 'V39')

The distribution of V39 resembles a normal distribution with outliers present on both sides. The data is centered around 0.9.

V40¶

In [95]:
round(df['V40'].median(),2)
Out[95]:
-0.92
In [96]:
histogram_boxplot(df, 'V40')

The distribution of V40 resembles a normal distribution with outliers present on both sides. The data is centered around -0.9.

Target¶

In [97]:
df['Target'].value_counts()
Out[97]:
0    18890
1     1110
Name: Target, dtype: int64
In [98]:
print('The percentage of non falures: ', 18890/df['Target'].count())

print('The percentage of failures: ', 1110/df['Target'].count())
The percentage of non falures:  0.9445
The percentage of failures:  0.0555
In [99]:
sns.countplot(data = df, x = 'Target')
Out[99]:
<Axes: xlabel='Target', ylabel='count'>

The percentage of failures in the dataset is approximately 6% and the percentage of non-faliures is approximately 94%. There is clearly an imbalance.

Plotting all the features at one go¶

In [100]:
for feature in df.columns:
    histogram_boxplot(df, feature, figsize=(12, 7), kde=False, bins=None) ## Please change the dataframe name as you define while reading the data

Data Pre-processing¶

In [4]:
df1 = df.copy()
In [7]:
test_data1 = test_data.copy()
In [8]:
#Separate X and Y in train
X = df1.drop('Target', axis=1)
y = df1['Target']
In [92]:
#Separate X and Y for test data
X_test = test_data1.drop('Target', axis=1)
y_test = test_data1['Target']
In [9]:
#Split train csv into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

#Print shape of train and validation 
print(X_train.shape, X_val.shape)
(16000, 40) (4000, 40)
In [10]:
print(y_train.value_counts() / y_train.count())
print("-" * 30)
print(y_val.value_counts() / y_val.count())
0   0.945
1   0.056
Name: Target, dtype: float64
------------------------------
0   0.945
1   0.056
Name: Target, dtype: float64

The ratio is the same for train and validation, ensuring no data leakage.

Missing value imputation¶

In [11]:
#Using imputer
imputer = SimpleImputer(strategy = 'median')
In [94]:
# Fit and transform the train data
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)

# Transform the validation data
X_val = pd.DataFrame(imputer.transform(X_val), columns=X_train.columns)

#Impute nan from the test data
X_test = pd.DataFrame(imputer.fit_transform(X_test), columns=X_train.columns)
In [13]:
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
------------------------------
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64

Model Building¶

Model evaluation criterion¶

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [5]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1
            
        },
        index=[0],
    )

    return df_perf
In [6]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    function for the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Defining scorer to be used for cross-validation and hyperparameter tuning¶

  • We want to reduce false negatives and will try to maximize "Recall".
  • To maximize Recall, we can use Recall as a scorer in cross-validation and hyperparameter tuning.
In [7]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

Model building with original data¶

In [112]:
%%time
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("lr", LogisticRegression(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

dtree: 0.7196280073636767
lr: 0.48988129245223133
Random Forest: 0.7195899193804354
Bagging: 0.7083222243382213
AdaBoost: 0.6215641465117756
GBM: 0.7173363803719928
XGBoost: 0.810804291246112

Validation Performance:

dtree: 0.7387387387387387
lr: 0.49099099099099097
Random Forest: 0.7432432432432432
Bagging: 0.7207207207207207
AdaBoost: 0.6576576576576577
GBM: 0.7432432432432432
XGBoost: 0.8153153153153153
CPU times: user 7min 42s, sys: 965 ms, total: 7min 43s
Wall time: 6min 58s
In [113]:
#boxplot with cv scores to compare distribution of scores
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results1)
ax.set_xticklabels(names)

plt.show()

When comparing boxplots of CV-score for the models with the original data, we see that Decision tree, Random forest, Bagging, GradientBoost and XGBoost perform the best.

Model Building with Oversampled data¶

In [17]:
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
In [115]:
%%time
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("lr", LogisticRegression(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))

results2 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset with oversampled data:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
    )
    results2.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset with oversampled data:

dtree: 0.9732668119313808
lr: 0.8812865538044636
Random Forest: 0.9855744607906776
Bagging: 0.9781630048735123
AdaBoost: 0.8935280870047044
GBM: 0.9239674518302545
XGBoost: 0.9906035856141958

Validation Performance:

dtree: 0.7387387387387387
lr: 0.49099099099099097
Random Forest: 0.7432432432432432
Bagging: 0.7207207207207207
AdaBoost: 0.6576576576576577
GBM: 0.7432432432432432
XGBoost: 0.8153153153153153
CPU times: user 12min 31s, sys: 1.76 s, total: 12min 33s
Wall time: 11min
In [116]:
#boxplot with cv scores to compare distribution of scores
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results2)
ax.set_xticklabels(names)

plt.show()

When comparing boxplots of CV-score for the models with the over sampled data, we see that Decision tree, Random forest, Bagging, GradientBoost and XGBoost perform the best. With using oversampled data, we may get an overfit model, which is present in the metrics from training to validation. We can address the overfitting during tuning.

Model Building with Undersampled data¶

In [18]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [118]:
%%time
models = []  # Empty list to store all the models

# Appending models into the list
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
models.append(("lr", LogisticRegression(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))

results3 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation performance on training dataset:" "\n")

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
    )
    results3.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean()))

print("\n" "Validation Performance:" "\n")

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores))
Cross-Validation performance on training dataset:

dtree: 0.8468355233923697
lr: 0.8513235574176348
Random Forest: 0.8975052370976957
Bagging: 0.8704627689963816
AdaBoost: 0.8715927124992063
GBM: 0.8907446200723672
XGBoost: 0.8930108550752237

Validation Performance:

dtree: 0.7387387387387387
lr: 0.49099099099099097
Random Forest: 0.7432432432432432
Bagging: 0.7207207207207207
AdaBoost: 0.6576576576576577
GBM: 0.7432432432432432
XGBoost: 0.8153153153153153
CPU times: user 2min 5s, sys: 476 ms, total: 2min 5s
Wall time: 1min 52s
In [119]:
#boxplot with cv scores to compare distribution of scores
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results3)
ax.set_xticklabels(names)

plt.show()

When comparing boxplots of CV-score for the models with the under sampled data, we see that Random forest, Bagging, GradientBoost and XGBoost perform the best. When comparing the training set to the validation set, we can see a level of overfitting that can be addressed during tuning.

HyperparameterTuning¶

Sample Parameter Grids¶

Hyperparameter tuning can take a long time to run, so to avoid that time complexity - you can use the following grids, wherever required.

  • For Gradient Boosting:

param_grid = { "n_estimators": np.arange(100,150,25), "learning_rate": [0.2, 0.05, 1], "subsample":[0.5,0.7], "max_features":[0.5,0.7] }

  • For Adaboost:

param_grid = { "n_estimators": [100, 150, 200], "learning_rate": [0.2, 0.05], "base_estimator": [DecisionTreeClassifier(max_depth=1, random_state=1), DecisionTreeClassifier(max_depth=2, random_state=1), DecisionTreeClassifier(max_depth=3, random_state=1), ] }

  • For Bagging Classifier:

param_grid = { 'max_samples': [0.8,0.9,1], 'max_features': [0.7,0.8,0.9], 'n_estimators' : [30,50,70], }

  • For Random Forest:

param_grid = { "n_estimators": [200,250,300], "min_samples_leaf": np.arange(1, 4), "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], "max_samples": np.arange(0.4, 0.7, 0.1) }

  • For Decision Trees:

param_grid = { 'max_depth': np.arange(2,6), 'min_samples_leaf': [1, 4, 7], 'max_leaf_nodes' : [10, 15], 'min_impurity_decrease': [0.0001,0.001] }

  • For Logistic Regression:

param_grid = {'C': np.arange(0.1,1.1,0.1)}

  • For XGBoost:

param_grid={ 'n_estimators': [150, 200, 250], 'scale_pos_weight': [5,10], 'learning_rate': [0.1,0.2], 'gamma': [0,3,5], 'subsample': [0.8,0.9] }

Decision Tree¶

Sample tuning method for Decision tree with original data

In [120]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7], 
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score=0.5675998222560782:
In [43]:
dtree_tuned = DecisionTreeClassifier(
    min_samples_leaf = 7,
    min_impurity_decrease = 0.0001,
    max_leaf_nodes = 15,
    max_depth = 5
)

dtree_tuned.fit(X_train, y_train)
Out[43]:
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
                       min_impurity_decrease=0.0001, min_samples_leaf=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=15,
                       min_impurity_decrease=0.0001, min_samples_leaf=7)
In [44]:
dtree_grid = model_performance_classification_sklearn(dtree_tuned, X_train, y_train)
dtree_grid
Out[44]:
Accuracy Recall Precision F1
0 0.974 0.593 0.904 0.717
In [45]:
dtree_grid_val = model_performance_classification_sklearn(dtree_tuned, X_val, y_val)
dtree_grid_val
Out[45]:
Accuracy Recall Precision F1
0 0.969 0.577 0.810 0.674
In [130]:
confusion_matrix_sklearn(dtree_tuned, X_val, y_val)

Sample tuning method for Decision tree with oversampled data

In [121]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7], 
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 15, 'max_depth': 3} with CV score=0.9143060712783726:
In [46]:
dtree_tuned_over = DecisionTreeClassifier(
    min_samples_leaf = 7,
    min_impurity_decrease = 0.001,
    max_leaf_nodes = 15,
    max_depth = 3
)

dtree_tuned_over.fit(X_train_over, y_train_over)
Out[46]:
DecisionTreeClassifier(max_depth=3, max_leaf_nodes=15,
                       min_impurity_decrease=0.001, min_samples_leaf=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3, max_leaf_nodes=15,
                       min_impurity_decrease=0.001, min_samples_leaf=7)
In [47]:
dtree_grid_over = model_performance_classification_sklearn(dtree_tuned_over, X_train_over, y_train_over)
dtree_grid_over
Out[47]:
Accuracy Recall Precision F1
0 0.838 0.917 0.792 0.850
In [48]:
dtree_grid_val_over = model_performance_classification_sklearn(dtree_tuned_over, X_val, y_val)
dtree_grid_val_over
Out[48]:
Accuracy Recall Precision F1
0 0.752 0.874 0.168 0.281
In [141]:
confusion_matrix_sklearn(dtree_tuned_over, X_val, y_val)

Sample tuning method for Decision tree with undersampled data

In [124]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,20),
              'min_samples_leaf': [1, 2, 5, 7], 
              'max_leaf_nodes' : [5, 10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'min_samples_leaf': 1, 'min_impurity_decrease': 0.001, 'max_leaf_nodes': 5, 'max_depth': 14} with CV score=0.8492287183393639:
In [49]:
dtree_tuned_under = DecisionTreeClassifier(
    min_samples_leaf = 1,
    min_impurity_decrease = 0.001,
    max_leaf_nodes = 5,
    max_depth = 14
)

dtree_tuned_under.fit(X_train_un, y_train_un)
Out[49]:
DecisionTreeClassifier(max_depth=14, max_leaf_nodes=5,
                       min_impurity_decrease=0.001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=14, max_leaf_nodes=5,
                       min_impurity_decrease=0.001)
In [50]:
dtree_grid_under = model_performance_classification_sklearn(dtree_tuned_under, X_train_un, y_train_un)
dtree_grid_under
Out[50]:
Accuracy Recall Precision F1
0 0.854 0.902 0.823 0.861
In [51]:
dtree_grid_val_under = model_performance_classification_sklearn(dtree_tuned_under, X_val, y_val)
dtree_grid_val_under
Out[51]:
Accuracy Recall Precision F1
0 0.768 0.878 0.178 0.296
In [149]:
confusion_matrix_sklearn(dtree_tuned_under, X_val, y_val)

Random Forest¶

Sample tuning method for Random Forest with original data

In [150]:
# defining model
Model = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": [200,250,300],
              "min_samples_leaf": np.arange(1, 4), 
              "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], 
              "max_samples": np.arange(0.4, 0.7, 0.1) 
              }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.7038786262934045:
In [52]:
rf_tuned = RandomForestClassifier(
    n_estimators = 300,
    min_samples_leaf = 1,
    max_samples = 0.6,
    max_features = 'sqrt'
    )

rf_tuned.fit(X_train, y_train)
Out[52]:
RandomForestClassifier(max_samples=0.6, n_estimators=300)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.6, n_estimators=300)
In [53]:
rf_grid = model_performance_classification_sklearn(rf_tuned, X_train, y_train)
rf_grid
Out[53]:
Accuracy Recall Precision F1
0 0.995 0.909 1.000 0.952
In [54]:
rf_grid_val = model_performance_classification_sklearn(rf_tuned, X_val, y_val)
rf_grid_val
Out[54]:
Accuracy Recall Precision F1
0 0.985 0.734 0.982 0.840
In [23]:
confusion_matrix_sklearn(rf_tuned, X_val, y_val)

Sample tuning method for Random Forest with over sampled data

In [151]:
# defining model
Model = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": [200,250,300],
              "min_samples_leaf": np.arange(1, 4), 
              "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], 
              "max_samples": np.arange(0.4, 0.7, 0.1) 
              }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 300, 'min_samples_leaf': 1, 'max_samples': 0.6, 'max_features': 'sqrt'} with CV score=0.9808099737442019:
In [55]:
rf_tuned_over = RandomForestClassifier(
    n_estimators = 300,
    min_samples_leaf = 1,
    max_samples = 0.6,
    max_features = 'sqrt'
    )

rf_tuned_over.fit(X_train_over, y_train_over)
Out[55]:
RandomForestClassifier(max_samples=0.6, n_estimators=300)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.6, n_estimators=300)
In [56]:
rf_grid_over = model_performance_classification_sklearn(rf_tuned_over, X_train, y_train)
rf_grid_over
Out[56]:
Accuracy Recall Precision F1
0 1.000 1.000 0.999 0.999
In [57]:
rf_grid_val_over = model_performance_classification_sklearn(rf_tuned_over, X_val, y_val)
rf_grid_val_over
Out[57]:
Accuracy Recall Precision F1
0 0.988 0.856 0.918 0.886
In [26]:
confusion_matrix_sklearn(rf_tuned_over, X_val, y_val)

Sample tuning method for Random Forest with under data

In [155]:
# defining model
Model = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": [200,250,300],
              "min_samples_leaf": np.arange(1, 4), 
              "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'], 
              "max_samples": np.arange(0.4, 0.7, 0.1) 
              }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 250, 'min_samples_leaf': 2, 'max_samples': 0.5, 'max_features': 'sqrt'} with CV score=0.8941979305529106:
In [58]:
rf_tuned_under = RandomForestClassifier(
    n_estimators = 250,
    min_samples_leaf = 2,
    max_samples = 0.5,
    max_features = 'sqrt'
    )

rf_tuned_under.fit(X_train_un, y_train_un)
Out[58]:
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=250)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.5, min_samples_leaf=2, n_estimators=250)
In [59]:
rf_grid_under = model_performance_classification_sklearn(rf_tuned_under, X_train_un, y_train_un)
rf_grid_under
Out[59]:
Accuracy Recall Precision F1
0 0.962 0.931 0.993 0.961
In [60]:
rf_grid_val_under = model_performance_classification_sklearn(rf_tuned_under, X_val, y_val)
rf_grid_val_under
Out[60]:
Accuracy Recall Precision F1
0 0.935 0.883 0.457 0.602
In [32]:
confusion_matrix_sklearn(rf_tuned_under, X_val, y_val)

Bagging¶

Sample tuning method for Bagging with original data

In [27]:
%%time

# defining model
Model = BaggingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { 'max_samples': [0.8,0.9,1], 
              'max_features': [0.7,0.8,0.9], 
              'n_estimators' : [30,50,70], }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 30, 'max_samples': 0.9, 'max_features': 0.9} with CV score=0.728648511394655:
CPU times: user 37.7 s, sys: 258 ms, total: 38 s
Wall time: 21min 33s
In [35]:
bagging_tuned = BaggingClassifier(
    n_estimators = 30,
    max_samples = 0.9,
    max_features = 0.9  
)

bagging_tuned.fit(X_train, y_train)
Out[35]:
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=30)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=30)
In [41]:
bagging_grid = model_performance_classification_sklearn(bagging_tuned, X_train, y_train)
bagging_grid
Out[41]:
Accuracy Recall Precision F1
0 0.999 0.973 1.000 0.986
In [42]:
bagging_grid_val = model_performance_classification_sklearn(bagging_tuned, X_val, y_val)
bagging_grid_val
Out[42]:
Accuracy Recall Precision F1
0 0.984 0.739 0.965 0.837
In [38]:
confusion_matrix_sklearn(bagging_tuned, X_val, y_val)

Sample tuning method for Bagging with over sampled data

In [30]:
%%time

# defining model
Model = BaggingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { 'max_samples': [0.8,0.9,1], 
              'max_features': [0.7,0.8,0.9], 
              'n_estimators' : [30,50,70], }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 70, 'max_samples': 0.9, 'max_features': 0.9} with CV score=0.9835892615034132:
CPU times: user 1min 58s, sys: 1.11 s, total: 1min 59s
Wall time: 30min 41s
In [39]:
bagging_tuned_over = BaggingClassifier(
    n_estimators = 70,
    max_samples = 0.9,
    max_features = 0.9
)

bagging_tuned_over.fit(X_train_over, y_train_over)
Out[39]:
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=70)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(max_features=0.9, max_samples=0.9, n_estimators=70)
In [61]:
bagging_grid_over = model_performance_classification_sklearn(bagging_tuned_over, X_train_over, y_train_over)
bagging_grid_over
Out[61]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [62]:
bagging_grid_val_over = model_performance_classification_sklearn(bagging_tuned_over, X_val, y_val)
bagging_grid_val_over
Out[62]:
Accuracy Recall Precision F1
0 0.984 0.860 0.857 0.858
In [42]:
confusion_matrix_sklearn(bagging_tuned_over, X_val, y_val)

Sample tuning method for Bagging with under sampled data

In [31]:
%%time

# defining model
Model = BaggingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { 'max_samples': [0.8,0.9,1], 
              'max_features': [0.7,0.8,0.9], 
              'n_estimators' : [30,50,70], }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 70, 'max_samples': 0.8, 'max_features': 0.7} with CV score=0.8953215260585285:
CPU times: user 2.79 s, sys: 62 ms, total: 2.85 s
Wall time: 1min 3s
In [63]:
bagging_tuned_under = BaggingClassifier(
    n_estimators = 70,
    max_samples = 0.8,
    max_features = 0.7
)

bagging_tuned_under.fit(X_train_un, y_train_un)
Out[63]:
BaggingClassifier(max_features=0.7, max_samples=0.8, n_estimators=70)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(max_features=0.7, max_samples=0.8, n_estimators=70)
In [64]:
bagging_grid_under = model_performance_classification_sklearn(bagging_tuned_under, X_train_un, y_train_un)
bagging_grid_under
Out[64]:
Accuracy Recall Precision F1
0 0.997 0.994 0.999 0.997
In [65]:
bagging_grid_val_under = model_performance_classification_sklearn(bagging_tuned_under, X_val, y_val)
bagging_grid_val_under
Out[65]:
Accuracy Recall Precision F1
0 0.943 0.896 0.491 0.635
In [66]:
confusion_matrix_sklearn(bagging_tuned_under, X_val, y_val)

GradientBoost¶

Sample tuning method for GradientBoost with original data

In [32]:
%%time

# defining model
Model = GradientBoostingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": np.arange(100,150,25), 
              "learning_rate": [0.2, 0.05, 1],
              "subsample":[0.5,0.7], 
              "max_features":[0.5,0.7]
              }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 0.2} with CV score=0.7602678854821303:
CPU times: user 15.3 s, sys: 188 ms, total: 15.5 s
Wall time: 5min 59s
In [67]:
gboost_tuned = GradientBoostingClassifier(
    subsample = 0.7,
    n_estimators = 125,
    max_features = 0.5,
    learning_rate = 0.2
)

gboost_tuned.fit(X_train, y_train)
Out[67]:
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
                           n_estimators=125, subsample=0.7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
                           n_estimators=125, subsample=0.7)
In [68]:
gboost_grid = model_performance_classification_sklearn(gboost_tuned, X_train, y_train)
gboost_grid
Out[68]:
Accuracy Recall Precision F1
0 0.994 0.912 0.978 0.944
In [69]:
gboost_grid_val = model_performance_classification_sklearn(gboost_tuned, X_val, y_val)
gboost_grid_val
Out[69]:
Accuracy Recall Precision F1
0 0.978 0.766 0.829 0.796
In [70]:
confusion_matrix_sklearn(gboost_tuned, X_val, y_val)

Sample tuning method for GradientBoost with over sampled data

In [33]:
%%time

# defining model
Model = GradientBoostingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": np.arange(100,150,25), 
              "learning_rate": [0.2, 0.05, 1],
              "subsample":[0.5,0.7], 
              "max_features":[0.5,0.7]
              }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 1} with CV score=0.9677077328831046:
CPU times: user 30.1 s, sys: 451 ms, total: 30.5 s
Wall time: 11min 48s
In [71]:
gboost_tuned_over = GradientBoostingClassifier(
    subsample = 0.7,
    n_estimators = 125,
    max_features = 0.5,
    learning_rate = 1
)

gboost_tuned_over.fit(X_train_over, y_train_over)
Out[71]:
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
                           subsample=0.7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
                           subsample=0.7)
In [72]:
gboost_grid_over = model_performance_classification_sklearn(gboost_tuned_over, X_train_over, y_train_over)
gboost_grid_over
Out[72]:
Accuracy Recall Precision F1
0 0.993 0.992 0.993 0.993
In [73]:
gboost_grid_val_over = model_performance_classification_sklearn(gboost_tuned_over, X_val, y_val)
gboost_grid_val_over
Out[73]:
Accuracy Recall Precision F1
0 0.969 0.842 0.678 0.751
In [74]:
confusion_matrix_sklearn(gboost_tuned_over, X_val, y_val)

Sample tuning method for GradientBoost with under sampled data

In [75]:
%%time

# defining model
Model = GradientBoostingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = { "n_estimators": np.arange(100,150,25), 
              "learning_rate": [0.2, 0.05, 1],
              "subsample":[0.5,0.7], 
              "max_features":[0.5,0.7]
              }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 0.2} with CV score=0.9031993905922681:
CPU times: user 1.59 s, sys: 66.8 ms, total: 1.66 s
Wall time: 42.3 s
In [76]:
gboost_tuned_under = GradientBoostingClassifier(
    subsample = 0.7,
    n_estimators = 125,
    max_features = 0.5,
    learning_rate = 0.2
)

gboost_tuned_under.fit(X_train_un, y_train_un)
Out[76]:
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
                           n_estimators=125, subsample=0.7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
                           n_estimators=125, subsample=0.7)
In [77]:
gboost_grid_under = model_performance_classification_sklearn(gboost_tuned_under, X_train_un, y_train_un)
gboost_grid_under
Out[77]:
Accuracy Recall Precision F1
0 0.995 0.992 0.998 0.995
In [78]:
gboost_grid_val_under = model_performance_classification_sklearn(gboost_tuned_under, X_val, y_val)
gboost_grid_val_under
Out[78]:
Accuracy Recall Precision F1
0 0.935 0.883 0.456 0.601
In [63]:
confusion_matrix_sklearn(gboost_tuned_under, X_val, y_val)

XGBoost¶

Sample tuning method for XGBoost with original data

In [79]:
%%time

# defining model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
            'scale_pos_weight': [5,10],
            'learning_rate': [0.1,0.2],
            'gamma': [0,3,5],
            'subsample': [0.8,0.9]
            }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.8, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.8536469243953533:
CPU times: user 53.2 s, sys: 862 ms, total: 54 s
Wall time: 19min 23s
In [82]:
xgboost_tuned = XGBClassifier(
    subsample = 0.8,
    scale_pos_weight = 10,
    n_estimators = 200,
    learning_rate = 0.1,
    gamma = 5
)

xgboost_tuned.fit(X_train, y_train)
Out[82]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=200, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=200, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
In [83]:
xgboost_grid = model_performance_classification_sklearn(xgboost_tuned, X_train, y_train)
xgboost_grid
Out[83]:
Accuracy Recall Precision F1
0 0.999 1.000 0.987 0.993
In [84]:
xgboost_grid_val = model_performance_classification_sklearn(xgboost_tuned, X_val, y_val)
xgboost_grid_val
Out[84]:
Accuracy Recall Precision F1
0 0.989 0.856 0.931 0.892
In [85]:
confusion_matrix_sklearn(xgboost_tuned, X_val, y_val)

Sample tuning method for XGBoost with over sampled data

In [26]:
%%time

# defining model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
            'scale_pos_weight': [5,10],
            'learning_rate': [0.1,0.2],
            'gamma': [0,3,5],
            'subsample': [0.8,0.9]
            }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.8, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.9958972606443475:
CPU times: user 1min 41s, sys: 2.25 s, total: 1min 43s
Wall time: 37min 8s
In [29]:
xgboost_tuned_over = XGBClassifier(
    subsample = 0.8,
    scale_pos_weight = 10,
    n_estimators = 200,
    learning_rate = 0.1, 
    gamma = 5
)

xgboost_tuned_over.fit(X_train_over, y_train_over)
Out[29]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=200, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=200, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
In [80]:
xgboost_grid_over = model_performance_classification_sklearn(xgboost_tuned_over, X_train_over, y_train_over)
xgboost_grid_over
Out[80]:
Accuracy Recall Precision F1
0 0.996 1.000 0.993 0.996
In [81]:
xgboost_grid_val_over = model_performance_classification_sklearn(xgboost_tuned_over, X_val, y_val)
xgboost_grid_val_over
Out[81]:
Accuracy Recall Precision F1
0 0.974 0.878 0.712 0.786
In [73]:
confusion_matrix_sklearn(xgboost_tuned_over, X_val, y_val)

Sample tuning method for XGBoost with under sampled data

In [20]:
%%time

# defining model
Model = XGBClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid={ 'n_estimators': [150, 200, 250],
            'scale_pos_weight': [5,10],
            'learning_rate': [0.1,0.2],
            'gamma': [0,3,5],
            'subsample': [0.8,0.9]
            }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 10, 'n_estimators': 200, 'learning_rate': 0.1, 'gamma': 5} with CV score=0.9223386021710149:
CPU times: user 6.19 s, sys: 152 ms, total: 6.34 s
Wall time: 2min 33s
In [21]:
xgboost_tuned_under = XGBClassifier(
    subsample = 0.9,
    scale_pos_weight = 10,
    n_estimators = 200,
    learning_rate = 0.1,
    gamma = 5
)

xgboost_tuned_under.fit(X_train_un, y_train_un)
Out[21]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=200, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=5, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=200, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
In [86]:
xgboost_grid_under = model_performance_classification_sklearn(xgboost_tuned_under, X_train_un, y_train_un)
xgboost_grid_under
Out[86]:
Accuracy Recall Precision F1
0 0.995 1.000 0.990 0.995
In [87]:
xgboost_grid_val_under = model_performance_classification_sklearn(xgboost_tuned_under, X_val, y_val)
xgboost_grid_val_under
Out[87]:
Accuracy Recall Precision F1
0 0.869 0.923 0.287 0.438
In [25]:
confusion_matrix_sklearn(xgboost_tuned_under, X_val, y_val)

Model performance comparison and choosing the final model¶

In [89]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        dtree_grid.T,
        dtree_grid_over.T,
        dtree_grid_under.T,
        rf_grid.T,
        rf_grid_over.T,
        rf_grid_under.T,
        bagging_grid.T,
        bagging_grid_over.T,
        bagging_grid_under.T,
        gboost_grid.T,
        gboost_grid_over.T,
        gboost_grid_under.T,
        xgboost_grid.T,
        xgboost_grid_over.T,
        xgboost_grid_under.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree Tuned with original data",
    "Decision Tree Tuned with over sampled data",
    "Decision Tree Tuned with under sampled data",
    "Random Forest Tuned with original data",
    "Random Forest Tuned with over sampled data",
    "Random Forest Tuned with under sampled data",
    "Bagging Tuned with original data",
    "Bagging Tuned with over sampled data",
    "Bagging Tuned with under sampled data",
    "Gradient Boost Tuned with original data",
    "Gradient Boost Tuned with over sampled data",
    "Gradient Boost Tuned with under sampled data",
    "XGBoost Tuned with original data",
    "XGBoost Tuned with over sampled data",
    "XGBoost Tuned with under sampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[89]:
Decision Tree Tuned with original data Decision Tree Tuned with over sampled data Decision Tree Tuned with under sampled data Random Forest Tuned with original data Random Forest Tuned with over sampled data Random Forest Tuned with under sampled data Bagging Tuned with original data Bagging Tuned with over sampled data Bagging Tuned with under sampled data Gradient Boost Tuned with original data Gradient Boost Tuned with over sampled data Gradient Boost Tuned with under sampled data XGBoost Tuned with original data XGBoost Tuned with over sampled data XGBoost Tuned with under sampled data
Accuracy 0.974 0.838 0.854 0.995 1.000 0.962 0.999 1.000 0.997 0.994 0.993 0.995 0.999 0.996 0.995
Recall 0.593 0.917 0.902 0.909 1.000 0.931 0.973 1.000 0.994 0.912 0.992 0.992 1.000 1.000 1.000
Precision 0.904 0.792 0.823 1.000 0.999 0.993 1.000 1.000 0.999 0.978 0.993 0.998 0.987 0.993 0.990
F1 0.717 0.850 0.861 0.952 0.999 0.961 0.986 1.000 0.997 0.944 0.993 0.995 0.993 0.996 0.995
In [90]:
# Validation performance comparison

models_val_comp_df = pd.concat(
    [
        dtree_grid_val.T,
        dtree_grid_val_over.T,
        dtree_grid_val_under.T,
        rf_grid_val.T,
        rf_grid_val_over.T,
        rf_grid_val_under.T,
        bagging_grid_val.T,
        bagging_grid_val_over.T,
        bagging_grid_val_under.T,
        gboost_grid_val.T,
        gboost_grid_val_over.T,
        gboost_grid_val_under.T,
        xgboost_grid_val.T,
        xgboost_grid_val_over.T,
        xgboost_grid_val_under.T,
    ],
    axis=1,
)
models_val_comp_df.columns = [
    "Decision Tree Tuned with original data",
    "Decision Tree Tuned with over sampled data",
    "Decision Tree Tuned with under sampled data",
    "Random Forest Tuned with original data",
    "Random Forest Tuned with over sampled data",
    "Random Forest Tuned with under sampled data",
    "Bagging Tuned with original data",
    "Bagging Tuned with over sampled data",
    "Bagging Tuned with under sampled data",
    "Gradient Boost Tuned with original data",
    "Gradient Boost Tuned with over sampled data",
    "Gradient Boost Tuned with under sampled data",
    "XGBoost Tuned with original data",
    "XGBoost Tuned with over sampled data",
    "XGBoost Tuned with under sampled data"
]
print("Validation performance comparison:")
models_val_comp_df
Validation performance comparison:
Out[90]:
Decision Tree Tuned with original data Decision Tree Tuned with over sampled data Decision Tree Tuned with under sampled data Random Forest Tuned with original data Random Forest Tuned with over sampled data Random Forest Tuned with under sampled data Bagging Tuned with original data Bagging Tuned with over sampled data Bagging Tuned with under sampled data Gradient Boost Tuned with original data Gradient Boost Tuned with over sampled data Gradient Boost Tuned with under sampled data XGBoost Tuned with original data XGBoost Tuned with over sampled data XGBoost Tuned with under sampled data
Accuracy 0.969 0.752 0.768 0.985 0.988 0.935 0.984 0.984 0.943 0.978 0.969 0.935 0.989 0.974 0.869
Recall 0.577 0.874 0.878 0.734 0.856 0.883 0.739 0.860 0.896 0.766 0.842 0.883 0.856 0.878 0.923
Precision 0.810 0.168 0.178 0.982 0.918 0.457 0.965 0.857 0.491 0.829 0.678 0.456 0.931 0.712 0.287
F1 0.674 0.281 0.296 0.840 0.886 0.602 0.837 0.858 0.635 0.796 0.751 0.601 0.892 0.786 0.438

From comparing the performance of the models we prioritize recall since unpredicted failures are the most expensive scenario. We are picking a final model based on recall performance and F1 score while comparing training to validation set. The model that performs well and may be best for production is Gradient Boost with over sampling as the metrics are not showing major signs of overfitting and the recall and F1 scores are high.

Test set final performance¶

In [95]:
#Testing best model on testing data set
gboost_over_test = model_performance_classification_sklearn(gboost_tuned_over, X_test, y_test)
print("Test Performance:")
gboost_over_test
Test Performance:
Out[95]:
Accuracy Recall Precision F1
0 0.965 0.826 0.645 0.725
In [96]:
confusion_matrix_sklearn(gboost_tuned_over, X_test, y_test)

The test data was used on the final model and produced a recall of approximately 0.83 and a F1 score of approximately 0.73. The confusion matrix shown above also indicates good performance, as the unpredicted failures are less than 1%.

In [97]:
feature_names = X.columns
importances = gboost_tuned_over.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

When analyzing the key predictive features of the final model, V39 and V18 are high in importance. V34, V26 and V11 are also mentionable variables that should be monitored.

Pipelines to build the final model¶

In [19]:
#creating list of variables
features = [
    "V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10",
    "V11", "V12", "V13", "V14", "V15", "V16", "V17", "V18", "V19", "V20",
    "V21", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "V29", "V30",
    "V31", "V32", "V33", "V34", "V35", "V36", "V37", "V38", "V39", "V40"
]

#transformer for missing values to replace with median
numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])

preprocessor = ColumnTransformer(
    transformers=[
        ("variables", numeric_transformer, features)],
    remainder="passthrough",
)
In [9]:
#Separate X and y
X = df1.drop("Target", axis=1)
y = df1["Target"]
In [10]:
# Splitting data into train and test 80:20
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
(16000, 40) (4000, 40)
In [11]:
#Checking target ratio in train and test
print(y_train.value_counts() / y_train.count())
print("-" * 30)
print(y_test.value_counts() / y_test.count())
0   0.945
1   0.056
Name: Target, dtype: float64
------------------------------
0   0.945
1   0.056
Name: Target, dtype: float64
In [20]:
#Pipeline with best found model
model = Pipeline(
    steps=[
        ("pre", preprocessor),
        (
          "GBM",
          GradientBoostingClassifier(
          subsample = 0.7,
          n_estimators = 125,
          max_features = 0.5,
          learning_rate = 1
          ),
          ),
    ]
)

# Fit the model on training data
model.fit(X_train, y_train)
Out[20]:
Pipeline(steps=[('pre',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('variables',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['V1', 'V2', 'V3', 'V4', 'V5',
                                                   'V6', 'V7', 'V8', 'V9',
                                                   'V10', 'V11', 'V12', 'V13',
                                                   'V14', 'V15', 'V16', 'V17',
                                                   'V18', 'V19', 'V20', 'V21',
                                                   'V22', 'V23', 'V24', 'V25',
                                                   'V26', 'V27', 'V28', 'V29',
                                                   'V30', ...])])),
                ('GBM',
                 GradientBoostingClassifier(learning_rate=1, max_features=0.5,
                                            n_estimators=125, subsample=0.7))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('pre',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('variables',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median'))]),
                                                  ['V1', 'V2', 'V3', 'V4', 'V5',
                                                   'V6', 'V7', 'V8', 'V9',
                                                   'V10', 'V11', 'V12', 'V13',
                                                   'V14', 'V15', 'V16', 'V17',
                                                   'V18', 'V19', 'V20', 'V21',
                                                   'V22', 'V23', 'V24', 'V25',
                                                   'V26', 'V27', 'V28', 'V29',
                                                   'V30', ...])])),
                ('GBM',
                 GradientBoostingClassifier(learning_rate=1, max_features=0.5,
                                            n_estimators=125, subsample=0.7))])
ColumnTransformer(remainder='passthrough',
                  transformers=[('variables',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median'))]),
                                 ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7',
                                  'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14',
                                  'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
                                  'V21', 'V22', 'V23', 'V24', 'V25', 'V26',
                                  'V27', 'V28', 'V29', 'V30', ...])])
['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40']
SimpleImputer(strategy='median')
[]
passthrough
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
                           subsample=0.7)
In [21]:
#checking performance on test data
model_test = model_performance_classification_sklearn(model, X_test, y_test)
model_test
Out[21]:
Accuracy Recall Precision F1
0 0.973 0.721 0.773 0.746

Business Insights and Conclusions¶

From our analysis, we have built and tested various models and have chosen to move forward with a GradientBoost Classifier built with oversampled data. The model was chosen for it's performance in recall (since this is given more importance for the cost of an unpredicted failure) and comparable metrics from training to testing sets.

EDA showed that most of the collected data resembled a normal distribution.

From the model, we have found that V39 and V18 are key factors in failure prediction. V34, V26 and V11 are also notable factors with prediction in failure. Due to confidentiality, there is missing context, but recommendation would be to explore these factors of degredation. Monitor these factors specifically and make repairs when needed to avoid the cost of having to replace equipment.

As well as a model being built, a pipeline has been constructed for production for easy analysis going forward.


In [ ]: